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Abstract. In this paper we address the challenge of assessing the quality 
of Wikipedia pages using scores derived from edit contribution and con- 
tributor authoritativeness measures. The hypothesis is that pages with 
significant contributions from authoritative contributors are likely to 
be high-quality pages. Contributions are quantified using edit longevity 
measures and contributor authoritativeness is scored using centrality 
metrics in either the Wikipedia talk or co-author networks. The results 
suggest that it is useful to take into account the contributor authori- 
tativeness when assessing the information quality of Wikipedia content. 
The percentile visualization of the quality scores provides some insights 
about the anomalous articles, and can be used to help Wikipedia editors 
to identify Start and Stub articles that are of relatively good quality. 



1 Introduction 

In recent years, the world has witnessed an exponential growth of User-Generated 
Content (UGC) applications. Among these applications, Wikipedia is the most 
successful one with the aim of harnessing the contributions of millions of in- 
dividuals to build a free collaborative encyclopedia. Wikipedia has attracted 
millions of visits everyday, and has become one of the most widely used sources 
of information on the web. On the other hand, Wikipedia articles are constantly 
changing, and the contributors range from casual visitors, to professionals and 
dedicated editors. When visitors access Wikipedia content linked through search 
engines, they are presented with the latest version of the article but they have 
no idea how much the content can be relied upon. Despite its great success as a 
means of knowledge sharing and collaboration, it is still difficult for visitors to 
develop an informed opinion about the reliability of much of the content available 
on Wikipedia. These issues have generated an increasing interest in studying the 
assessment of the trustworthiness of Wikipedia content |H2l3l4j . 

By extensive study on the co-author network of the English Wikipedia com- 
munity, Laniado and Tasso find that a nucleus of very active contributors, who 
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seem to spread over the whole wiki, tend to interact preferentially with the less 
experienced users [3]. This finding is supported by the growing centrality of 
the very active contributors in the co-author network. These dedicated editors 
play a fundamental role in the community of Wikipedia in terms of spreading 
knowledge, information and experience across the whole wiki. In this paper, we 
explore the idea of assessing the quality of Wikipedia articles leveraging the 
scores derived from edit contributions and contributor authoritativeness met- 
rics. Edit contributions are quantified using edit longevity measure and contrib- 
utor authoritativeness is scored using network centrality metrics in either the 
Wikipedia talk or co-author networks. While the former captures author contri- 
butions recorded in the complete edit history of the articles, the latter measures 
contributor authoritativeness that encodes the communication patterns in the 
wikipedia networks. The intuition is that articles with significant contributions 
from authoritative contributors are likely to be of high quality, and that high- 
quality articles generally involve more communication and interaction between 
authors. By incorporating this information into the assessment of the quality of 
Wikipedia articles, we expect to develop a better strategy to assess the quality 
of Wikipedia content. 

In the next section we provide a brief review of some relevant studies on the 
assessment of information quality for Wikipedia content and network analysis 
of Wikipedia. Then in Section 3 we introduce the edit longevity metric used to 
calculate the contribution of each author to a page, while in Section 4 we describe 
the centrality metrics that we use to measure contributor authoritativeness. Our 
models used to assess the quality of Wikipedia articles are presented in Section 
5, and an evaluation of the models is provided in Section 6. The paper concludes 
in Section 7 with some discussions and an outline of future work. 

2 Related work 

Recently, researchers have shown an increased interest in measuring the quality 
or trustworthiness of UGC and Wikipedia content in particular. For instance, 
Adler et al. propose to make use of a trust quality metric (i.e., author reputation 
based on edit longevity) to measure the reliability of Wikipedia content [312] . 
Hu et al. propose several models to assess the quality of Wikipedia articles and 
contributor authority based on the assumption that good contributors usually 
contribute good articles and good articles are contributed by good authors [3]. 
However they neglect the fact that people who have high authority and knowl- 
edge may only possess that for a specific domain. Moturu and Liu evaluate the 
trustworthiness of social media content using feature categories identified from 
sociological theory, and adopt unsupervised trust scoring models to combine 
these features [B]. Different from previous studies, Kane [T] performs a quantita- 
tive study of the collaborative features associated with 188 similar high-quality 
Wikipedia articles in an attempt to better understand the mechanism behind 
the success of peer-produced collaboration in wiki environments. In the study, 
the author examines the relationship between the quality of Wikipedia articles 
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and the four collaborative features associated with each article. The quantitative 
analysis suggests that different sets of features have different influence on the 
quality of peer-produced information. Later, Kane and Ransbotham [7] study 
the relationship between collaboration patterns and the quality of Wikipedia 
articles by using ordinal regression. In their study, collaboration patterns are 
measured in terms of the number of distinct contributors to each article, degree 
centrality and eigenvector centrality of each article in the two-mode network. By 
experiment on 16,068 articles from the Medicine WikiProject, the authors show 
that there is a recursive, positive correlation between quality and collaboration 
on Wikipedia articles [7]. Wu et al. [5] propose to characterize the quality of 
Wikipedia articles solely using network motif profiles, and demonstrate that the 
network motif-based characterization can be used to classify good from ordinary 
quality articles with reasonable accuracy. However, to our knowledge, very few 
researchers propose to assess the quality of Wikipedia articles by combining the 
influence of authors in the network and their edit contributions derived from the 
edit history of the articles. 

Korfiatis et al. [9] propose a network-based approach to evaluate authoritative 
sources in Wikipedia by using the centrality metrics from a two-mode network 
of articles and contributors. They evaluate their quality measure on a small 
dataset consisting of ten articles in the "Philosophy" domain from the English 
Wikipedia, and suggest that it could be useful to utilize the social network 
measures to evaluate the authoritativeness of content found in Wikipedia and 
similar sources. This study is similar in spirit to the strategy in our work as 
centrality metrics are used to measure contributor authoritativeness. 

There have been many quantitative studies on Wikipedia content to mea- 
sure author contribution, such as the number of edits performed by authors 
(e.g., [lOlllj ). the total number of words introduced by contributors jTT], text 
survial and edit distance |3ll2j . Adler et al. propose a set of metrics and efneent 
algorithms to compute author contributions, and show that edit longevity is a 
good indicator of author contribution [T2). While the two widely used criteria, 
edit count and text count, are naive and easy to compute, they fail to capture the 
size or the quality of the contributions. In contrast, edit longevity takes into ac- 
count the amount of edits performed by an author and the survival of these edits 
in the subsequent revisions. In this work, we adopted the edit longevity metric 
|3ll2j to measure author contributions to a wiki page, both for its accuracy and 
its efficient computation using the open source WikiTrust software- 
Some researchers have studied the co-author network of Wikipedia, aiming 
at finding patterns of collaboration and cooperation in the process of Wikipedia 
content creation. These studies are mainly based on the simple assumption that 
two users edited the same page is enough to establish a co-authorship (e.g., [13]), 
and usually fail to scale to the size of Wikipedia in the major languages such as 
English. Laniado and Tasso [5] utilize edit longevity to compute a score to eval- 
uate the contribution of each contributor to each wiki page, then select the main 
contributors for the page according to the scores, the co-author network is then 
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constructed based on the selected main contributors for each page. We employed 
this approach to generate the co-author network for the English Wikipcdia. 

To facilitate direct communication between Wikipedians, Mediawiki software 
assigns to every registered user a user page and a user talk page (i.e., UTP). 
Similar to Wikipedia articles, these pages can be edited by anyone. Any user 
can leave another user a message by editing their UTP, the owner can choose to 
reply to a message on her own UTP or on the desired receiver's UTP. Massa [Tl] 
extracts the communication network for Venetian Wikipedia users by reading 
and coding the messages left on UTP conversations. Specifically, for each message 
written by user A on user B's talk page, the two users were added as nodes 
to the network and a corresponding edge from A to B was created, with the 
weight of the edge representing the number of messages A wrote to B. Analyzing 
the social network of Wikipedia may provide a deeper insight into the social 
dynamics of Wikipedia, and reveal communication patterns (i.e., the flow of 
knowledge, experience and Wikipedia rules) among Wikipedia users [13]. We 
chose to rely on the open source wiki-network softwar^l released by Massa to 
construct the talk network for English Wikipedia. This software provides two 
algorithms to generate the talk network from UTPs, i.e., signaturc2graph.py 
and utpcdits2graph.py. The former generates the talk network by parsing and 
counting signatures on the current version of UTPs in the current data dump, 
the latter builds the talk network on the complete dump. Since the history 
algorithm extracts the talk network from the whole edit history of UTPs, it 
generally captures more communications and interactions among the users. As 
stated by the author in [14], since the existing signatures are not affected by a 
rename, the signature algorithm usually fails to detect the rename issue, while 
the history algorithm is not affected by this issue. In this work, we consider 
and compare the centrality metrics from the two talk networks in assessing the 
quality of Wikipedia articles. 

3 Edit Longevity 

The computation of edit longevity can be summerized as follows [12] : Suppose 
we have a wiki page p with n > versions Vq, Vi, «2, v„, the initial version vq is 
empty, the i-th version Vi (i £ [1, n]) is obtained by an author editing a revision: 
Vi-i — > V{. Since each revision is edited by only one author, we denote the author 
who performed revision as a^. The edit contribution made in a revision is 
defined as d(ri) = d(vi^i, Vi). The edit distance between two versions, Vi and Uj, 
is computed by 

^) = rna X (I,D)- 1 -min(I,D) + M (1) 

where I{vi,Vj) is the number of words that are inserted, D(vi,Vj) represents 
the number of words that are deleted, M(vt,Vj) denotes the number of words 
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that arc moved, times the fraction of the document that they move across. The 
details of this definition can be found in [3]. 

The edit distance between Vi and Vj is a quantity measure, and it evaluates 
how much change (measured in terms of word additions, deletions, replacements, 
disrcplacements, etc.) there has been going on from Vi to Vj. The quality (also 
termed as longevity) of the edits performed in revision of a page (corresponding 
to Vi) is defined as follows |3U2j : 

d(vi-x,Vj) - d(vi,Vj) 
^«(^)= (2) 

The focus here is on the quality of edit that brought the page from Vi-i to 
Vi . The objective is to get an assessment of how much of that edit has survived 
in version Vj. Adler et al. take special care to make sure that the edit distance 
d(vi, vj) satisfies the triangular inequality, so that a e dit(vi, Vj) takes values from 
-1 to +1. a ec iit(vi,Vj) = —1 means the revisions performed by a,; are completely 
reverted by the following authors, o, ,/,/ ( !>,■ , Vj) = +1 indicates the edit contribu- 
tion made by a,; in revision r.i are completely preserved by other authors |3ll2j . 
When the value of a e dit(yi,Vj) does not fall into [-1,+1], we can trim it to one 
of these two values. 

Due to frequent vadalism that happens in the wiki, it is a good idea to judge 
the edit quality using several succeeding revisions. Let us denote J ri as the set 
of the first ten versions after that have authors different from that of n. For 
J ri ^ 0, the average edit quality a e dit {vi , Vj ) of is defined as follows |12j : 



a edit (vuv j ) = v - l -[ ^2 a edit (vi,Vj)j (3) 



The edit longevity is computed by combining the size of the edit performed 
by an author and the longevity of the edit in the following revisions. Thus, the 
edit longevity of a revision r made by an author a can be defined as: 

EditLong(r) = oiedit{Vi-,Vj) ■ d(r) (4) 

Similar to the work in [S] , we are also only concerned with the positive contribu- 
tion carried by each author to a page, and neglect revisions bringing a negative 
score. We denote E a ^ p as the set of edits performed by author a on page p, then 
the contribution of author a to page p can be computed by accumulating the 
edit longevity over all revisions (i.e., edits) performed by this author as follows: 

contribution(a,p) = EditLong(e) (5) 

e£E aiP \EditLong(e)>0 

We select the main contributors for each page using a similar strategy to [5] : 
as the anonymous users in Wikipedia do not have reliable nicknames, we discard 
all anonymous contributions. Let us denote the set of all registered users who 
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edited page p as U p , ordered by descending edit longevity. The set of authors for 
page p are selected as the subset A p C U p including the first users of U p that 
make at least 8 percent of edit contribution to p, i.e., 

J2a£A p \contribution(a,p)>M ,\A P \>K COntribution(a,p) ^ 

Y^ueu contribution(u,p) 

where M is the minimum threshold for author contribution, K is the minimum 
number of authors considering as authors of p, 8 G [0, 100] is the percentage con- 
tribution threshold. Considering that Wikipcdia is the result of peer-produced 
collaboration, we impose the \A p \ > K constraint on author selection so that 
we can select more authors even for those pages with their edit contributions 
dominated by a few contributors. 



4 Author Centrality 

In this work, we aim at measuring contributor authoritativeness using centrality 
measured in either the talk network and co-author network of the Wikipedia. 
In the 1940s, Bavelas introduced the idea of correlating the position of an in- 
dividual in a social network with the relative influence or importance of that 
individual in the network for organizational communication |15) . Since then, 
many centrality metrics (e.g., degree, betweenness, eigenvector, etc) have been 
proposed to investigate the influence and special properties of individuals in a 
network (e.g., |16ll7j V In this work, we mainly adopt three widely-used certral- 
ity measures to assess contributor authoritativeness: degree, betweenness and 
eigenvector centrality. 

The degree centrality of a node measures its potential communication activ- 
ity or participation across the network, and is defined as the number of nodes 
adjacent to it [17]. The betweenness centrality of a node is defined as the propor- 
tion of the overall shortest paths between other pairs of nodes that pass through 
a particular node |17j . The betweenness centrality of node n is computed as |17j : 



betweenness(n) = \P™j I ^ 
i,j ^ 



where, for each pair of nodes in the network, pij are all the shortest paths 
linking i and j, pi„j are the ones that passing through node n. The intuition 
is that the larger the betweenness of a node becomes the more influence it will 
have on the information flow in the whole network, i.e., betweenness centrality 
is the index of the potential for control of communication |5ll7j . In the case 
of Wikipcdia, nodes with higher betweenness scores make the propagation of 
information and knowledge easier. 

Eigenvector centrality is a measure of influence or popularity for nodes based 
on the adjacency matrix of the network [18) . it takes into account a wide range of 
direct and indirect influences in the global network. The eigenvector centrality 
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of individuals in the network is the eigenvector of the adjacency matrix that 
corresponds to the largest eigenvalue A [TH]- Google's PageRank JTJJ] measure is 
a variant of eigenvector centrality, we also make use of the PageRank variant to 
evaluate contributor authoritativeness for authors of Wikipedia articles. 

5 Quality Measures 

In this work, we employ three models to assess the quality of Wikipedia pages: 
an edit contribution-based model, a centrality-based model, and the combina- 
tion of edit contribution and contributor authoritativeness. Our basic model is 
designed based on the principle that "the higher the edit contribution of the ar- 
ticle becomes, the better quality is the article" . This model measures the quality 
of an article by aggregating the edit longevity of all its author contributions, and 
is defined as follows: 

BcjQScore{p) = contribution(a,p) (8) 

a£A p 

Similarly, our centrality-based model is based on the principle that "the 
higher the authority of the authors in a specific domain, the better quality is 
the article" . This model measures the quality of an article by the aggregation of 
authorities (i.e., centrality) from all its authors, and is defined as follows: 

CenjQScoreijp) = centrality {a) (9) 

While the basic model captures the author contributions recorded in the edit 
history of the articles, the centrality-based models mainly consider the contribu- 
tor authoritativeness that encodes the communication patterns in the wikipedia 
networks. In contrast with the two previously mentioned models, a complicated 
way of assessing articles is to aggregate the edit longevity measure and contribu- 
tor authoritativeness. The intuition for this is that each author plays a different 
role in the network (measured in terms of centrality), by nature some authors are 
more influential than others in the network. By incorporating this information 
in measuring author contribution to a page, we are expected to develop a better 
strategy to assess the quality of Wikipedia pages. For each author a of page p, 
its contribution to p can be computed as follows: 

Author Score(a,p) = contribution(a, p) ■ centrality (a) (10) 

It is worth noting that the values of contribution(a, p) and centrality (a) for dif- 
ferent authors can have different scales, so it is a good idea to normalize all these 
measures. To facilitate analysis in the following section, we denote the complex 
models derived from the combination of edit contribution with PageRank, eigen- 
vector, degree, betweenness-based centrality as Com-QScorepR, C om_Q S core e i gi 
Com_QScoredegree and Com-QScoreuw, respectively. Then the score for page 
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p can be obtained by summing over the contributions of the set of its authors 
A p as follows: 

Corri-QScore(p) = Author Score(a,p) (11) 

adA p 

6 Evaluation 

In this section, wc provide an experimental evaluation of our approach on a 
dataset that consists of 9290 history-related articles from the WikiProject His- 
torji that fall under the FA(164), A(6), GA(312), B(906), C(900), Start(4072) 
and Stub(2930) classefl The dataset was generated by extracting the complete 
edit history of the 9290 articles from the complete XML dump on 2012/02/110. 
After collecting the dataset, we computed the edit contributions of all authors 
(i.e., edit longevity) using the WikiTrust software. Regarding author selection, 
we set the percentage contribution threshold 9=0.9, minimum number of authors 
A"=20, and minimum edit contribution M=10_f We generated two talk net- 
work versions for the WikiProject History: TalkNctwork(signature) and TalkNet- 
work(utpcdits) by removing users who do not participate in the project from the 
two complete talk networks for the English Wikipedia. The difference between 
these two talk networks is explained at the end of section [5] The co-author net- 
work for the project is also generated. The statistics for the networks are shown 
in TablefTJ To get contributor authoritativeness, we performed centrality analysis 
on the talk networks and co-author network using Gephi software. 



Table 1: Statistics for the networks with M=10, A=20 and 0=0.9 



Networks 


With bots 


Without bots 


^Authors 


#nodes 


#edges 


^Authors 


#nodes 


hedges 


Coauthor Network 


18,844 


18,844 


628,524 


19,606 


19,606 


712,685 


TalkNetwork(signature) 


14,728 


29,813 


15,301 


30,656 


TalkNetwork(utpedits) 


17,034 


704,248 


17,700 


723,088 



We employ the Normalized Discounted Cumulative Gain (NDCG) metric 
[2"U] to evaluate the performance of our quality measurement models to rank the 
Wikipedia articles. The NDCG metric was introduced by Jarvelin et al. [5D] to 
measure the ability of a document retrieval algorithm to rank entries that are 
more relevant to the query. This metric has beed used by other researchers to 

3 http://en.wikipedia.Org/wiki/Wikipedia:WikiProject_History 

4 More detail on the Wikipedia quality classes can be found at 
http://en.wikipedia.Org/wiki/Wikipedia:Version_l.0_Editorial_Team/Assessment. 

5 Available at http://dumps.wikimedia.org/enwiki/20120211/ 

6 Varying the values of the parameters we did not observe remarkable differences in 
the experimental results. 
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evaluate the quality or trustworthiness of Wikipedia content [4 6 because it is 
suitable for ranking entries with multiple levels of assessment (e.g., Wikipedia 
quality ratings FA > A > GA > B > C > Start > Stub class labels). In the case 
of Wikipedia, we expect that articles that are more useful or more trustworthy 
to be ranked highly. NDCG is calculated as follows: 

1 . fc 9 s ( r ) _ 1 

NDCG = - Y — (12) 

where Z is a normalization factor calculated so that a perfect ranking of the 
top k articles would yield a NDCG of 1 and s(r) denotes the score given to 
the article ranked at position r. In our case, we set different scores for different 
classes: s(r) = 6 for featured article, s(r) = 5 for A-star article and so on down 
to s(r) = 1 for Start-class article, s(r) = for Stub-class article indicating that 
a Stub-class article at position r does not contribute to the cumulative gain. 



6.1 Evaluation on All History-related Articles 

In this group of experiments, we evaluate our models using contributor author- 
itativeness derived from the three networks including and then excluding bot 
users. Table [2] presents the NDCG performance of our models on this dataset. In 
terms of excluding/ including bots when assessing the quality, there is not much 
difference between the NDCG performance of the models on the three networks. 
It is apparent that the basic model performs much better than centrality-based 
models in terms of NDCG score. Regarding the complex models, on Coauthor- 



Table 2: NDCG scores for the whole Wikipedia History dataset 



Models 


CoauthorNetwork 


TalkNet (signature) 


TalkNet (utpedits) 


w/o bots 


with bots 


w/o bots 


with bots 


w/o bots 


with bots 


PageRank-based 


0.73 


0.74 


0.73 


0.73 


0.75 


0.75 


Eigenvector-based 


0.75 


0.74 


0.75 


0.75 


0.76 


0.76 


Degree-based 


0.74 


0.74 


0.75 


0.75 


0.76 


0.76 


Betweenness-based 


0.73 


0.73 


0.75 


0.75 


0.74 


0.74 


Basic model 


0.78 


0.78 


0.78 


0.78 


0.78 


0.78 


Com-Q Score pr 


0.77 


0.77 


0.77 


0.77 


0.79 


0.79 


C om_QScore ei g en 


0.77 


0.77 


0.78 


0.78 


0.79 


0.79 


Com_QScoredegree 


0.77 


0.77 


0.78 


0.78 


0.78 


0.78 


Com-QScoret,tw 


0.76 


0.77 


0.78 


0.78 


0.78 


0.78 



Network and TalkNetwork(signature), we can see that their NDCG performance 
are very close to that of the basic model; while on the TalkNetwork(utpedits), the 
complex models perform a little better than the basic model (with an improve- 
ment about 1% in Com_QScorepu and Com-QScore e i gen ). It is also obvious 
that the NDCG performance of the models using contributor authoritativeness 
from TalkNetwork (utpedits) are slightly better than that of the corresonding 
models on CoauthorNetwork and TalkNctwork(signaturc). This indicates that 



10 Xiangju Qin and Padraig Cunningham 



the TalkNctwork(utpcdits) is more informative (in terms of capturing the cum- 
munications and interactions among authors) than other two networks. By in- 
specting on the talk network, we notice that the signaturc2graph.py fails at 
extracting some active contributors (e.g., Borsok£0; Ptolemy Caesariorjf], Car- 
losPrH) when generating the network, which makes the TalkNetwork(signature) 
smaller and less accurate. Overall, combining edit contribution and contributor 
authoritativeness to assess the quality of Wikipedia articles improves the NDCG 
performance to some extent, which implies that it is beneficial to take into con- 
sideration the contributor authoritativeness (i.e. communication patterns) when 
assessing the information quality of Wikipedia content. One explanation for this 
may be that articles with significant contributions from authoritative contrib- 
utors are likely to be of high quality, and that high-quality articles generally 
involve more communication and interaction between authors. 





Fig. 1: Percentile distribution of wikipedia quality scores, using contributor au- 
thoritativeness from TalkNetwork(utpedits) without considering bot users. 



To further understand the differences between these models, we also provide 
another view of the results from this experiment. In Fig.[T]we plot the percentile 
distribution of Wikipedia quality scores (ordered by descending quality scores) 

7 http://en.wikipedia.Org/wiki/User_talk:Borsoka 

8 http://en.wikipedia.Org/wiki/User_talk:Ptolemy_Caesarion 

9 http://en.wikipedia.Org/wiki/User_talk:CarlosPn 



Assessing the Quality of Wikipedia Pages 



11 



for four models. The proportion of articles from a certain class that fall into a 
percentile is represented by a suitable sized bubble with color. We observe that 
there is a slight difference between the quality scores of the basic model and 
complex model, with the latter ranking a bit more Featured articles in top 90 
percentile than the former. All of the four models rank many Featured, A-class 
and Good articles very highly in top 90 percentile, the majority of Good, B and 
C-class articles are ranked dominantly in top 80 percentile, the largest amount 
of Start and Stub-class articles are ranked in the lower percentile. It is apparent 
that there is some overlapping between the classes, which indicates that no major 
difference exist between Featured and Good articles or between B and C-class 
articles. Overall, the percentile distribution of quality scores for the four models 
are very similar: with the quality scores for all the classes except Featured and 
A-class articles distributed across the whole percentile. 

It is worth noting that the quality scores of a small number of B, C, Start 
and even Stub-class articles are ranked very highly and appear in the top 90 per- 
centile. By investigating those anomalously ranking articles, we find that: those 
highly ranking B or C-class articles are generally degraded from FA or GA status 
to its current status or are of good quality due to massive edits by Wikipedia 
users (e.g., History of classical music traditional. Stone Ag<0); those anoma- 
lous Start-class articles are mainly either degraded from other higher quality 
status (e.g., Monroe Doctrine^ and Battle of Stirling BridgtF^l degraded from 
FA status to Start-class) or are of relatively good quality and will be promoted 
to higher quality status as long as issues such as additional citations for verifica- 
tion being solved (e.g, at the time of writing the paper, History of Mexico^! was 
rerated from Start-class to C-class status, History of Scotland^ rcratcd from 
Start-class to B-class) ; the small number of highly ranked Stub articles are due 
to similar reasons such as being shortened from a longer version of the article 
which misused sources (e.g., Mathematics in medieval IslanJ^h. Basically, these 
highly ranked articles usually have some controversial issues which bring about 
edit war among the contributors. 

On the other hand, some Good, B and C-class articles are ranked in the 
lower 10 percentile. We find that these articles are rated as low importance by 
WikiProjects, and generally belong to short and concise articles that involve par- 
ticipation of very few authors (e.g., John of Argylj^]). Wu et al. [5] also observe 
the similar trend when they visualize the network motif profiles of Wikipedia 
articles. To summerise, the percentile visualization of the quality scores provide 
some insights about analyzing the anomalous or outlier articles in the datasets, 



10 http://en.wikipedia.Org/wiki/Talk:History_of_classicaLmusic_traditions 

11 http://en.wikipedia.org/wiki/Stone_Age 

12 http://en.wikipedia.org/wiki/Monroe_Doctrine 

13 http://en.wikipedia.org/wiki/Battle_of_Stirling_Bridge 

14 http://en.wikipedia.Org/wiki/Talk:History_of_Mexico 

15 http://en.wikipedia.org/wiki/History_of_Scotland 

16 http://en.wikipedia.org/wiki/Mathematics_in_medievalJslam 

17 http://en.wikipedia.org/wiki/John_of_Argyll 
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and can be used to help Wikipedia editors to identify Start and Stub articles 
that are of relatively good quality. 

It is also worth noting that there are some differences between the percentile 
distribution of the quality scores we present here and the direct distribution 
of Wikipedia trust quality scores by Moturu and Liu [5] with regard to the 
type of distribution and the overlapping between different classes. Moturu and 
Liu plot the distribution according to the proportion of the normalized trust 
scores from a certain class that falls into each of the 11 Trust Score Categories 
(ranging from to 10), we plot the percentile distribution of the scores based 
on the descending ranking of the quality scores. While our results contain some 
overlapping between Good, B, C and Start-class articles, their results contain 
less overlapping between different classes. We have explained the reasons for 
our anomalies above. A further reason for the difference may be that Moturu 
and Liu evaluate their models on a very small dataset of 230 health-related 
Wikipedia articles, while we evaluate our models on a larger dataset of 9290 
history-related Wikipedia articles and we do not filter out the anomalous or 
controversial articles. 

6.2 Evaluation on Filtering Dataset 

In the results presented in Fig. [TJ it is not easy to get a clear picture of the 
relative performance of the different methods primarily because of the poor 
separation between the classes. In order to present a clearer picture we performed 
a further evaluation on some filtered datasets where there is clearer separation 
of the classes. This is consistent with evaluations performed by other researchers 
(e.g., |BJ, [5]). We use contributor authoritativeness from TalkNetwork(utpedits) 
without considering bot users, the result is depicted in Table [31 The distribution 
of the filtered datasets is: FA(164), C(439), Start(413), Stub(279). 

Table 3: NDCG performance on filtering dataset with fewer classes. 





Basic Model 


PageRank-based 


Com^Q Score pr 


FA-C-Start-Stub 


0.806 


0.778 


0.837 


FA-C 


0.808 


0.783 


0.837 


FA-Start-Stub 


0.983 


0.965 


0.990 


FA-Start 


0.984 


0.968 


0.991 


FA-Stub 


0.995 


0.992 


0.996 



As expected, we observe an improvement in NDCG with the removal of A- 
class, Good, and B-class articles and the reduction of the number of C, Start 
and Stub-class articles. One reason for this is that with the elimination and 
reduction of these classes, which are very close to Featured and C, Start, Stub- 
class articles, there is clearer class separation. Moreover, the fact that removing 
Start and Stub articles shows little improvement in NDCG means that they are 
easy to distinguish from Featured and C-class articles - however there is clear 
overlapping between Featured and C-class articles. Finally, as with Moturu and 
Liu [5], we also observe a maximum NDCG of 1 when only FA and Stub classes 
are considered. 
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Fig. 2: Precision-recall curves on filtered dataset. 



We also test on binary dataset which allows us to do some precision-recall 
analysis with which other researchers will be more familiar than NDCG anal- 
ysis. The distribution of this dataset is: FA(164), A(6), GA(313), Start(414), 
Stub(280). We regard Featured, A-Class, Good articles as relevant to the target 
query, and remaining classes as irrelevant entries, the curve is shown in Fig. 
[21 We observe from Fig. [5] that the curve closest to the upper right-hand cor- 
ner of the graph is corresponding to the complex model, which indicates the 
best performance among the three models. The curve of the basic model lies 
in the middle of the three curves, with the curve of PagcRank-based model ly- 
ing slightly below those of other two models. The precision-recall curves further 
prove that it is useful to take into account the inflence of users in the network 
of Wikipedia when assessing the quality of Wikipedia content. 

7 Conclusions and Future Work 

In this paper, we study models for assessing the quality of Wikipedia pages based 
on edit contribution and contributor authoritativeness metrics. Edit contribu- 
tions arc quantified using edit longevity measure and contributor authoritative- 
ness is scored using network centrality metrics in either the Wikipedia talk or 
co-author networks. We evaluate our quality measurement models on dataset 
that consists of 9290 history-related articles from the WikiProject History. 

The results suggest that it is useful to take into account the contributor au- 
thoritativeness (i.e., the centrality metrics of the contributors in the Wikipedia 
networks) when assessing the information quality of Wikipedia content. The im- 
plication for this is that articles with significant contributions from authoritative 
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contributors are likely to be of high quality, and that high-quality articles gen- 
erally involve more communication and interaction between contributors. The 
percentile visualization of the quality scores provides some insights about the 
outlier articles, and can be used to help Wikipcdia editors to identify Start and 
Stub articles that are of relatively good quality. 

At present, we do not have any special treatment to deal with reverted edits 
that do not introduce new content to a page. In some occasions, the edit longevity 
of users who reverted edits by malicious users to the last version dominate the 
edit contribution of that page. For instance, user HappvCampeiF^l made only one 
edit to Battle of Stirling Bridg<j]]| by reverting the edit from an anonymous user 
(who rewrote the content of the whole page in a malicious way) to its last version. 
However, the edit longevity of this author to the page is 97995.8, which takes 
up about 92% of the total edit contribution for the page. This is a misleading 
assessment of the author's contribution to that page and in turn distorts any 
resulting quality scores that use this assessment. 

In this study, we impose the \A p \ > K constraint on author selection so that 
we can select more authors even for those pages with their edit contributions 
dominated by a few contributors to compensate for this. In future work we will 
take steps to deal with reverted edits in the analysis as they can produce a false 
impression of author contribution. We noticed from our evaluation that there 
are some anomalous situations resulting from reorganization of the Wikipedia 
pages, for instance an article may be considerably shortened by moving material 
to another page or a new page. Therefore, we have to be more careful when 
using WikiTrust software to measure author contribution to Wikipedia articles, 
because in these cases the edit contribution obtained from the software may 
not be a real reflection of the author contribution. At present, we neglect the 
temporal information in the edit history of the Wikipedia pages and UTPs, it 
will be interesting to evaluate how the quality of the Wikipedia pages changes 
over time. 
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